108
9
Probability and Likelihood
provide a convenient way to normalize (render dimensionless) a random variable,
namely
bold upper X Superscript bold asterisk Baseline equals StartFraction bold upper X minus mu Subscript upper X Baseline Over sigma Subscript upper X Baseline EndFraction periodX∗= X −μX
σX
.
(9.35)
The covariance measures the linear association between variables bold upper XX and bold upper YY and is
defined as
Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis equals bold upper E left parenthesis bold upper X minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis bold upper E left parenthesis bold upper Y minus bold upper E left parenthesis bold upper Y right parenthesis right parenthesis equals bold upper E left parenthesis bold upper X upper Y right parenthesis minus bold upper E left parenthesis bold upper X right parenthesis bold upper E left parenthesis bold upper Y right parenthesisCov(X, Y) = E(X −E(X))E(Y −E(Y)) = E(XY) −E(X)E(Y)
(9.36)
explicitly, as
Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis equals StartFraction 1 Over n EndFraction sigma summation Underscript j equals 1 Overscript n Endscripts left parenthesis x Subscript j Baseline minus mu Subscript upper X Baseline right parenthesis left parenthesis y Subscript j Baseline minus mu Subscript upper Y Baseline right parenthesis periodCov(X, Y) = 1
n
n
E
j=1
(x j −μX)(y j −μY) .
(9.37)
It equals zero if the variables are independent (uncorrelated). The correlation coef-
ficient rho left parenthesis bold upper X bold comma bold upper Y right parenthesisρ(X, Y) is a normalized covariance:
rho left parenthesis bold upper X bold comma bold upper Y right parenthesis equals StartFraction Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis Over sigma Subscript x Baseline sigma Subscript y Baseline EndFraction periodρ(X, Y) = Cov(X, Y)
σxσy
.
(9.38)
It is connected with the linear dependence of bold upper XX and bold upper YY, but can be zero even if bold upper YY is
a function of bold upper XX. If more than two variables are involved, it is convenient to arrange
the pairwise covariances in the so-called covariance matrix. The scatter matrix upper SS of
nn samples of mm-dimensional data is defined as
upper S equals sigma summation Underscript j equals 1 Overscript n Endscripts left parenthesis bold upper X Subscript j Baseline minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis left parenthesis bold upper X Subscript j Baseline minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis Superscript normal upper T Baseline periodS =
n
E
j=1
(X j −E(X))(X j −E(X))T .
(9.39)
If the variables are normally distributed, the (normalized) scatter matrix provides an
estimate of the covariance matrix.
Problem. Calculate the means and variances of the binomial and Poisson distribu-
tions.
9.3.1
Runs
Studies of the statistical properties of DNA and the like often start by stating the
total number of the four bases A, C, T, and G. This information entirely neglects
information on the order in which they occur. The theory of the distribution of runs
is one way of handling this information. A run is defined as a succession of similar
events preceded and succeeded by different events; the number of elements in a run
will be referred to as its length. The number of runs of course equals the number of
unlike neighbours.